Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

136

Applications in Natural Language Processing

both on expectation and standard deviation compared to the full-precision baseline and

the ternary model. For instance, the top-1 eigenvalues of MHA-O in the binary model are

∼15× larger than the full-precision counterpart. Therefore, the quantization loss increases

of full-precision and ternary model are tighter bounded than the binary model in Eq. (5.19).

The highly complex and irregular landscape by binarization thus poses more challenges to

the optimization.

5.7.1

Ternary Weight Splitting

Given the challenging loss landscape of binary BERT, the authors proposed ternary weight

splitting (TWS) that exploits the ﬂatness of ternary loss landscape as the optimization proxy

of the binary model. As is shown in Fig. 2.4, they ﬁrst train the half-sized ternary BERT

to convergence, and then split both the latent full-precision weight W^tand quantized ^ˆ

W^t

to their binary counterparts W^b

1^,^W^b

2 ^{and ˆ}

W^b

1^,^ˆ

W^b

2 ^{via the TWS operator. To inherit the}

performance of the ternary model after splitting, the TWS operator requires the splitting

equivalency ( i.e., the same output given the same input):

W^t= W^b

1 ⁺^W^b

2^,

W^t= ^ˆ

W^b

1 ^{+ ˆ}

W^b

2 ^.

(5.20)

While solution to Eq. (5.20) is not unique, the latent full-precision weights W^b

1^,^W^b

2 ^are

constrained after splitting to satisfy W^t= W^b

1 ⁺^W^b

2 ^as

W^b

1,i ⁼

⎧

⎨

⎩

a · W^t

W^t

i ^̸^{= 0}

b + W^t

W^t

i ^{= 0}^,^W^t

i ^>⁰

otherwise

(5.21)

W^b

2,i ⁼

⎧

⎨

⎩

(1−a)W^t

W^t

i ^̸^{= 0}

−b

W^t

i ^{= 0}^,^W^t

i ^>⁰

−b + W^t

otherwise

(5.22)

where a and b are the variables to solve. By Eq. (5.21) and Eq. (5.22) with ^ˆ

W^t= ^ˆ

W^b

1 ^{+ ˆ}

W^b

2^,

we get

a =

i∈I ^|^W^t

i^|⁺

j∈J ^|^W^t

j^{| −}

k∈K ^|^W^t

k^|

i∈I ^|^W^t

i^|

b =

|I|

i∈I ^|^W^t

i^{| −}ⁿ

i=1 ^|^W^t

i^|

2(|J | + |K|)

(5.23)

where we denote I = {i | ^ˆ

W^t

i ^̸^{= 0}^}^,^J⁼^{^j^|^ˆ

W^t

j ^{= 0 and}^W^t

j ^>⁰^}^and^K⁼^{^k^|^ˆ

W^t

k ⁼

0 and W^t

k ^<⁰^}^.^{| · |}^{denotes the cardinality of the set.}

5.7.2

Knowledge Distillation

Further, the authors proposed to boost the performance of binarized BERT by Knowledge

Distillation (KD), which is shown to beneﬁt BERT quantization [285]. Following [106, 285],

they ﬁrst performed intermediate-layer distillation from the full-precision teacher network’s

embedding E, layer-wise MHA output Ml and FFN output Fl to the quantized student

counterpart ^ˆE, ^ˆMl, ^ˆFl (l = 1, 2, ...L). To minimize their mean squared errors, i.e., ℓemb =

MSE(^ˆE, E), ℓmha =

l ^{MSE( ˆ}^M^l^,^M^l^{), and}^ℓ^ffn⁼

l ^MSE(ˆ^F^l^,^F^l^{), the objective function}

falls in

ℓint = ℓemb + ℓmha + ℓffn.

(5.24)